Lecture 9: August 25th, 2023#
Reminders:
All EDA outcome quizzes have been posted. Attempt the ones you’re missing, and let me know if any issues come up. Come to student hours it any issues come up! Anthony and I are here to help.
“50 years of data science” token-earning assignment due tonight at midnight. As always, this is optional.
I’m almost done writing the new homeworks for next week and they will be uploaded by tonight. They will be due Week 4 Friday at midnight instead of Wednesday.
Coming up:
On Monday, we’ll go through the instructions for the final project.
The planning worksheet for the final project will be due during Week 5.
Today:
We’ll introduce Machine Learning (ML)
We’ll start by coding for linear regression
Anthony will go through a worksheet on generating data for regression problems. Definitely go, if you are able to!
Introduction to Machine Learning#
Let’s take another fieldtrip…to the iPad!






Performing Linear Regression Using scikit-learn#
import pandas as pd
import altair as alt
import seaborn as sns
Import the taxis data from Seaborn.
df = sns.load_dataset("taxis")
df.sample(5)
| pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1835 | 2019-03-03 13:23:27 | 2019-03-03 13:35:31 | 1 | 1.67 | 10.0 | 0.0 | 0.0 | 13.3 | yellow | cash | Central Park | Times Sq/Theatre District | Manhattan | Manhattan |
| 3122 | 2019-03-21 01:03:46 | 2019-03-21 01:11:46 | 1 | 2.30 | 10.0 | 0.0 | 0.0 | 13.8 | yellow | cash | Lower East Side | Williamsburg (North Side) | Manhattan | Brooklyn |
| 4177 | 2019-03-27 15:31:25 | 2019-03-27 15:48:19 | 1 | 2.03 | 12.0 | 0.0 | 0.0 | 15.3 | yellow | cash | Penn Station/Madison Sq West | Midtown East | Manhattan | Manhattan |
| 5313 | 2019-03-11 10:48:05 | 2019-03-11 11:20:00 | 1 | 2.20 | 19.0 | 0.0 | 0.0 | 22.3 | yellow | cash | Midtown Center | West Chelsea/Hudson Yards | Manhattan | Manhattan |
| 1824 | 2019-03-14 21:20:40 | 2019-03-14 21:34:04 | 1 | 2.81 | 11.5 | 1.7 | 0.0 | 17.0 | yellow | credit card | Midtown Center | Yorkville West | Manhattan | Manhattan |
Drop rows with missing values
df = df.dropna()
Using Altair, make a scatter plot with “fare” on the y-axis and with “distance” on the x-axis.
alt.Chart(df).mark_circle().encode(
x="distance",
y="fare"
)
---------------------------------------------------------------------------
MaxRowsError Traceback (most recent call last)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:2520, in Chart.to_dict(self, *args, **kwargs)
2518 copy.data = core.InlineData(values=[{}])
2519 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2520 return super().to_dict(*args, **kwargs)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:838, in TopLevelMixin.to_dict(self, *args, **kwargs)
836 copy = self.copy(deep=False) # type: ignore[attr-defined]
837 original_data = getattr(copy, "data", Undefined)
--> 838 copy.data = _prepare_data(original_data, context)
840 if original_data is not Undefined:
841 context["data"] = original_data
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/v5/api.py:100, in _prepare_data(data, context)
98 # convert dataframes or objects with __geo_interface__ to dict
99 elif isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
--> 100 data = _pipe(data, data_transformers.get())
102 # convert string input to a URLData
103 elif isinstance(data, str):
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
17 @curried.curry
18 def default_data_transformer(data, max_rows=5000):
---> 19 return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/altair/utils/data.py:82, in limit_rows(data, max_rows)
80 values = data
81 if max_rows is not None and len(values) > max_rows:
---> 82 raise MaxRowsError(
83 "The number of rows in your dataset is greater "
84 f"than the maximum allowed ({max_rows}).\n\n"
85 "See https://altair-viz.github.io/user_guide/large_datasets.html "
86 "for information on how to plot large datasets, "
87 "including how to install third-party data management tools and, "
88 "in the right circumstance, disable the restriction"
89 )
90 return data
MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).
See https://altair-viz.github.io/user_guide/large_datasets.html for information on how to plot large datasets, including how to install third-party data management tools and, in the right circumstance, disable the restriction
alt.Chart(...)
Here, we get a MaxRowsError; Altair can only work with data that has less than or equal to 5000 rows.
Choose 5000 random rows to avoid the
max_rowserror.
Let’s get a random selection of 5000 rows from df. I’m not going to worry about getting reliable random rows, the point of this part is just to get a feel for what the data looks like.
alt.Chart(df.sample(5000)).mark_circle().encode(
x="distance",
y="fare"
)
Looking at the data, it seems to be roughly linear. It’s not perfectly linear, but we should be able to approximate a line pretty well. The only weird thing is that horizontal line…let’s see what’s going on there by adding a tooltip.
James brought up a great point: some of the rides go a distance of zero miles…and are still charged. Let’s remove these points from our data, because this seems very strange.
alt.Chart(df.sample(5000)).mark_circle().encode(
x="distance",
y="fare",
tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)
df2 = df.sample(5000,random_state=10)
df2 = df2[df2["distance"] > 0]
alt.Chart(df2).mark_circle().encode(
x="distance",
y="fare",
tooltip=["dropoff_zone","pickup_zone","fare","distance"]
)
The horizontal line all involves rides going to or from an airport. This looks like some kind of fixed price promotion where you can go to the airport (or get picked up from the airport) and go anywhere within a region for a fixed price.
What would you estimate is the slope of the “line of best fit” for this data?
We have the points \((0.02,2.5)\) and \((5,16)\)
#The slope
(16-2.5)/(5-0.02)
2.710843373493976
If I had to approximte the line, I’d say the slope is about 2.71.
There is a routine in scikit-learn that we will see many times! Starting now!
1.) Import 2.) Instantiate (create an instance of an object from an appropriate class) 3.) Fit 4.) Predict
Find this slope using the
LinearRegressionclass from scikit-learn.
#1.) import
from sklearn.linear_model import LinearRegression
Create a LinearRegression object and name it reg (for regression)
#2.) Instantiate
reg = LinearRegression()
type(reg)
sklearn.linear_model._base.LinearRegression
We see reg is a linear regression object. This is not from base python, it belongs to scikit-learn.
Below, let’s try to fit the data. We’re going to get an error, and I can say that you will most likely run into this error many times on your own.
#3.) Fit
reg.fit(df2["distance"],df2["fare"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [14], in <cell line: 2>()
1 #3.) Fit
----> 2 reg.fit(df2["distance"],df2["fare"])
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/base.py:1151, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1144 estimator._validate_params()
1146 with config_context(
1147 skip_parameter_validation=(
1148 prefer_skip_nested_validation or global_skip_validation
1149 )
1150 ):
-> 1151 return fit_method(estimator, *args, **kwargs)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/linear_model/_base.py:678, in LinearRegression.fit(self, X, y, sample_weight)
674 n_jobs_ = self.n_jobs
676 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 678 X, y = self._validate_data(
679 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
680 )
682 has_sw = sample_weight is not None
683 if has_sw:
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/base.py:621, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
619 y = check_array(y, input_name="y", **check_y_params)
620 else:
--> 621 X, y = check_X_y(X, y, **check_params)
622 out = X, y
624 if not no_val_X and check_params.get("ensure_2d", True):
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/utils/validation.py:1147, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1142 estimator_name = _check_estimator_name(estimator)
1143 raise ValueError(
1144 f"{estimator_name} requires y to be passed, but the target y is None"
1145 )
-> 1147 X = check_array(
1148 X,
1149 accept_sparse=accept_sparse,
1150 accept_large_sparse=accept_large_sparse,
1151 dtype=dtype,
1152 order=order,
1153 copy=copy,
1154 force_all_finite=force_all_finite,
1155 ensure_2d=ensure_2d,
1156 allow_nd=allow_nd,
1157 ensure_min_samples=ensure_min_samples,
1158 ensure_min_features=ensure_min_features,
1159 estimator=estimator,
1160 input_name="X",
1161 )
1163 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1165 check_consistent_length(X, y)
File ~/opt/miniconda3/envs/math9/lib/python3.9/site-packages/sklearn/utils/validation.py:940, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
938 # If input is 1D raise error
939 if array.ndim == 1:
--> 940 raise ValueError(
941 "Expected 2D array, got 1D array instead:\narray={}.\n"
942 "Reshape your data either using array.reshape(-1, 1) if "
943 "your data has a single feature or array.reshape(1, -1) "
944 "if it contains a single sample.".format(array)
945 )
947 if dtype_numeric and hasattr(array.dtype, "kind") and array.dtype.kind in "USV":
948 raise ValueError(
949 "dtype='numeric' is not compatible with arrays of bytes/strings."
950 "Convert your data to numeric values explicitly instead."
951 )
ValueError: Expected 2D array, got 1D array instead:
array=[2.8 1.2 2.1 ... 2.68 1.6 1.47].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
What goes wrong here is that reg.fit expects a two dimensional array for the input, but we passed the pandas Series df["distance]. We should think of pandas Series as one-dimensional objects.
df2["distance"].shape
(4972,)
Notice the blank after the comma when we call shape. This is letting us know that the pandas Series in one dimension.
Observe the difference with the following:
df2[["distance"]]
| distance | |
|---|---|
| 2871 | 2.80 |
| 898 | 1.20 |
| 845 | 2.10 |
| 1580 | 3.35 |
| 4002 | 10.70 |
| ... | ... |
| 1812 | 1.20 |
| 2191 | 13.11 |
| 4827 | 2.68 |
| 4326 | 1.60 |
| 5779 | 1.47 |
4972 rows Ă— 1 columns
df2[["distance"]].shape
(4972, 1)
The example above is treated as a DataFrame with just one column. This is what happens when I pass a list df[[...]].
One way that we can remember when we did two dimensions versus one dimenion is the use of capital letters. The capital “X” means that we need two dimensions, while the lower-case “y” means we need a single dimension.
reg.fit(df2[["distance"]],df2["fare"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
At this point, reg has done all of the hard work of finding a linear equation that approximates our data (“fare” as a linear function of “distance”.)
Recall: The original question was asking us to find the slope. Here’s how we can get it:
Slop is stored as the coef_ attribute.
reg.coef_
array([2.72848668])
Notice that this is a NumPy array, if I wanted to extract just the number, I could do this:
reg.coef_[0]
2.7284866819996245
We had estimated before that the slope would be about 2.71, so I think we did a pretty good job :)
Find the intercept.
The intercept is stored as the intercept_ attribute.
reg.intercept_
4.660714229453321
Putting these together, the equation of our line is given by: $\( \text{fare} \approx 2.7284866819996245*(\text{distance}) + 4.660714229453321 \)$
Good Question from the Chat: Why does reg.intercept_ not give you an array.
Answer: It has to do with how the function looks. In our case, we had just one input that we were training on: distance. So our model looks like what we wrote above. We don’t need to just consider distance by itself, we could also consider distance, number of people, and the hour of the taxi ride. If we train on these variables, then we get 3 distinct coefficients. These coefficients will be returned in a NumPy array.
What are the predicted outputs for the first 5 rows? What are the actual outputs?
df2[:5]
| pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2871 | 2019-03-12 20:28:02 | 2019-03-12 20:43:16 | 1 | 2.80 | 12.0 | 3.15 | 0.00 | 18.95 | yellow | credit card | Upper East Side South | East Village | Manhattan | Manhattan |
| 898 | 2019-03-24 13:17:38 | 2019-03-24 13:31:41 | 1 | 1.20 | 10.0 | 2.65 | 0.00 | 15.95 | yellow | credit card | Murray Hill | Clinton East | Manhattan | Manhattan |
| 845 | 2019-03-04 13:22:23 | 2019-03-04 13:38:07 | 1 | 2.10 | 11.5 | 2.96 | 0.00 | 17.76 | yellow | credit card | Midtown East | Upper West Side South | Manhattan | Manhattan |
| 1580 | 2019-03-21 23:31:03 | 2019-03-21 23:42:56 | 1 | 3.35 | 12.0 | 3.16 | 0.00 | 18.96 | yellow | credit card | Kips Bay | Lincoln Square East | Manhattan | Manhattan |
| 4002 | 2019-03-16 08:55:35 | 2019-03-16 09:37:31 | 3 | 10.70 | 39.0 | 9.10 | 5.76 | 54.66 | yellow | credit card | Manhattan Valley | LaGuardia Airport | Manhattan | Queens |
Notice, we have a distance of 2.8 and a fare of 12. The model will predict the following for a distance of 2.8:
reg.coef_*2.8 + reg.intercept_
array([12.30047694])
reg.predict(df2[:5][["distance"]])
array([12.30047694, 7.93489825, 10.39053626, 13.80114461, 33.85552173])
reg.fit' is still a little mysterious, but reg.predict` is not, it just evaluates our linear function at the distances.
Interpreting Linear Regression Coefficients#
Add a new column to the DataFrame, called “hour”, which contains the hour at which the pickup occurred.
df2.columns
Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
'pickup_borough', 'dropoff_borough'],
dtype='object')
df2.dtypes
pickup datetime64[ns]
dropoff datetime64[ns]
passengers int64
distance float64
fare float64
tip float64
tolls float64
total float64
color object
payment object
pickup_zone object
dropoff_zone object
pickup_borough object
dropoff_borough object
dtype: object
df2["hour"] = df2["pickup"].dt.hour
df2.head()
| pickup | dropoff | passengers | distance | fare | tip | tolls | total | color | payment | pickup_zone | dropoff_zone | pickup_borough | dropoff_borough | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2871 | 2019-03-12 20:28:02 | 2019-03-12 20:43:16 | 1 | 2.80 | 12.0 | 3.15 | 0.00 | 18.95 | yellow | credit card | Upper East Side South | East Village | Manhattan | Manhattan | 20 |
| 898 | 2019-03-24 13:17:38 | 2019-03-24 13:31:41 | 1 | 1.20 | 10.0 | 2.65 | 0.00 | 15.95 | yellow | credit card | Murray Hill | Clinton East | Manhattan | Manhattan | 13 |
| 845 | 2019-03-04 13:22:23 | 2019-03-04 13:38:07 | 1 | 2.10 | 11.5 | 2.96 | 0.00 | 17.76 | yellow | credit card | Midtown East | Upper West Side South | Manhattan | Manhattan | 13 |
| 1580 | 2019-03-21 23:31:03 | 2019-03-21 23:42:56 | 1 | 3.35 | 12.0 | 3.16 | 0.00 | 18.96 | yellow | credit card | Kips Bay | Lincoln Square East | Manhattan | Manhattan | 23 |
| 4002 | 2019-03-16 08:55:35 | 2019-03-16 09:37:31 | 3 | 10.70 | 39.0 | 9.10 | 5.76 | 54.66 | yellow | credit card | Manhattan Valley | LaGuardia Airport | Manhattan | Queens | 8 |
Remove all rows from the DataFrame where the hour is 16 or earlier. (So we are only using late afternoon and evening taxi rides.)
That’s all we got to today! We’ll pick back up on Monday.
Add a new column to the DataFrame, called “duration”, which contains the amount of time in minutes of the taxi ride.
Hint 1. Because the “dropoff” and “pickup” columns are already date-time values, we can subtract one from the other and pandas will know what to do.
Hint 2. I expected there to be a minutes attribute (after using the dt accessor) but there wasn’t. Call dir to see some options.
Fit a new
LinearRegressionobject, this time using “distance”, “hour”, “passengers” as the input features, and using “duration” as the target value.